Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix moderation batching #191

Merged
merged 1 commit into from
Oct 18, 2023
Merged

Fix moderation batching #191

merged 1 commit into from
Oct 18, 2023

Conversation

mruwnik
Copy link
Collaborator

@mruwnik mruwnik commented Oct 15, 2023

No description provided.

:param Callable[[str], int] tokens_counter: the function used to count tokens
"""
# A very ugly loop that will split the `texts` into smaller batches so that the
# total sum of tokens in each batch will not exceed `max_batch_size`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the comment says. This is very ugly. I couldn't think of anything better that wouldn't be a lot more complicated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea what part of this loop is ugly, but we have https://github.com/StampyAI/stampy/blob/853e28b2e002f50d5861583cf09254093fd4e397/utilities/utilities.py#L307 in stampy bot if you want some inspiration (but this one looks better TBH) :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, both are terrible :P I wanted something that would do it in say around 2-3 lines. This offends my sensibilities. But it's better than any alternatives :/

@@ -93,11 +93,31 @@ def _single_batch_moderation_check(batch: List[str]) -> List[ModerationInfoType]
return openai.Moderation.create(input=batch)["results"]


def moderation_check(texts: List[str], max_texts_num: int = 32) -> List[ModerationInfoType]:
"""Batch moderation checks on list of texts."""
def moderation_check(texts: List[str], max_batch_size: int = 4096, tokens_counter: Callable[[str], int] = len) -> List[ModerationInfoType]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this uses len to calculate token usage, which will overestimate by something like 2-3 times. It's also used elsewhere here, from what I can see, so I left it for now

:param Callable[[str], int] tokens_counter: the function used to count tokens
"""
# A very ugly loop that will split the `texts` into smaller batches so that the
# total sum of tokens in each batch will not exceed `max_batch_size`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea what part of this loop is ugly, but we have https://github.com/StampyAI/stampy/blob/853e28b2e002f50d5861583cf09254093fd4e397/utilities/utilities.py#L307 in stampy bot if you want some inspiration (but this one looks better TBH) :D

@@ -1,5 +1,5 @@
import logging
from typing import List, Tuple, Dict, Any, Optional
from typing import List, Tuple, Dict, Any, Optional, Callable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to reconsider the wisdom against wild imports, for the specific case of from typing import * it might not be as evil as in the general case 🤔

def moderation_check(texts: List[str], max_texts_num: int = 32) -> List[ModerationInfoType]:
"""Batch moderation checks on list of texts."""
def moderation_check(texts: List[str], max_batch_size: int = 4096, tokens_counter: Callable[[str], int] = len) -> List[ModerationInfoType]:
"""Batch moderation checks on list of texts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to document the part that explains "what is a moderation check", not the part that explains that code with batch variable is doing some batching 😅

@mruwnik mruwnik merged commit 714b252 into main Oct 18, 2023
2 of 3 checks passed
@mruwnik mruwnik deleted the fix-moderation-issue branch October 18, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants